Tokenization of Tunisian Arabic: a comparison between three Machine Learning models

نویسندگان

چکیده

Tokenization represents the way of segmenting a piece text into smaller units called tokens. Since Arabic is an agglutinating language by nature, this treatment becomes crucial preprocessing step for many Natural Language Processing (NLP) applications such as morphological analysis, parsing, machine translation, information extraction, and so on. In article, we investigate word tokenization task with rewriting process to rewrite orthography stem. For task, are using Tunisian (TA) text. To best researchers’ knowledge, first study that uses TA tokenization. Therefore, start collecting preparing various corpora from different sources. Then, present comparison three character-based tokenizers based on Conditional Random Fields (CRF), Support Vector Machines (SVM) Deep Neural Networks (DNN). The proposed model CRF achieved F-measure result 88.9%.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Arabic Tokenization System

Tokenization is a necessary and non-trivial step in natural language processing. In the case of Arabic, where a single word can comprise up to four independent tokens, morphological knowledge needs to be incorporated into the tokenizer. In this paper we describe a rule-based tokenizer that handles tokenization as a full-rounded process with a preprocessing stage (white space normalizer), and a ...

متن کامل

Bayesian Learning of Tokenization for Machine Translation

Training a statistical machine translation system starts with tokenizing a parallel corpus. Some languages such as Chinese do not incorporate spacing in their writing system, which creates a challenge for tokenization. Morphologically rich languages such as Korean and Hungarian present an even bigger challenge, since optimal token boundaries for machine translation in these languages are often ...

متن کامل

Thermal conductivity of Water-based nanofluids: Prediction and comparison of models using machine learning

Statistical methods, and especially machine learning, have been increasingly used in nanofluid modeling. This paper presents some of the interesting and applicable methods for thermal conductivity prediction and compares them with each other according to results and errors that are defined. The thermal conductivity of nanofluids increases with the volume fraction and temperature. Machine learni...

متن کامل

Thermal conductivity of Water-based nanofluids: Prediction and comparison of models using machine learning

Statistical methods, and especially machine learning, have been increasingly used in nanofluid modeling. This paper presents some of the interesting and applicable methods for thermal conductivity prediction and compares them with each other according to results and errors that are defined. The thermal conductivity of nanofluids increases with the volume fraction and temperature. Machine learni...

متن کامل

A Conventional Orthography for Tunisian Arabic

Tunisian Arabic is a dialect of the Arabic language spoken in Tunisia. Tunisian Arabic is an under-resourced language. It has neither a standard orthography nor large collections of written text and dictionaries. Actually, there is no strict separation between Modern Standard Arabic, the official language of the government, media and education, and Tunisian Arabic; the two exist on a continuum ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: ACM Transactions on Asian and Low-Resource Language Information Processing

سال: 2023

ISSN: ['2375-4699', '2375-4702']

DOI: https://doi.org/10.1145/3599234